Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback (1.1163 BPB)#1231
Open
nestamidavaine wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…cord) Progressive depth recurrence (1->2->3 passes) with diagonal error feedback and jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical under competition constraints. 3-seed mean: 1.1163 bpb / 1.8848 nats (std 0.0013) Baseline PR openai#549: 1.1194 bpb / 1.8901 nats Improvement: -0.0031 bpb / -0.0053 nats 8xH100 SXM, 600s wallclock, all artifacts under 16MB decimal.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recurrent Depth with Progressive Pass Growth + Error Feedback
val_bpb: 1.1163 (3-seed mean, std 0.0013) | ~15.96 MB | 8×H100 SXM
A non-record submission targeting significant improvement over PR #549 (LeakyReLU² baseline, 1.1194 mean bpb). Achieves -0.0031 bpb vs that baseline. For an in-depth analysis of depth recurrence in this competition, see PR #363. I targeted 549 when I started building this solution, after I finished evaluation the new improved model has been published to the leaderboard. However I believe the experiments here can be applied to any model to improve performance, with the largest benefit for submissions using TTT since the recurrance makes use of the 10 available minutes of evaluation time very effectively.
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
We significantly beat the PR #549 LeakyReLU² baseline (1.1194 mean bpb / 1.8901 nats) by -0.0031 bpb / -0.0053 nats across all three seeds (1.1163 mean bpb / 1.8848 nats), achieving the goal we set out with.
Progressive Recurrence Architecture
The Problem: Depth Recurrence Fails Under Competition Constraints
PR #363 demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but hurts bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a +0.025 bpb gap (looped worse) due to two compounding taxes:
Our Solution: Late Growth + Contractive Stabilization
We address both taxes by growing recurrence depth progressively during training and stabilizing the recurrent dynamics.
Progressive Pass Schedule (Late Growth)
The key insight: start training with 1 pass and gradually add passes late in training. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule:
This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11.
We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, 3-pass wins the step/capacity trade-off, the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes.
Learnable Residual Scaling
Per-pass learnable scalars contract the residual update, preventing hidden state magnitude growth across passes:
where$\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive — later passes refine rather than amplify.
Error Feedback Module
A low-rank correction compensates for accumulated error before each recurrence pass:
where$U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: 2,560 params (negligible vs 26.7M model params).
The feedback module is important but not strictly required — we confirmed that stable training is possible without it, and even running eval-only without feedback works, at a cost of ~0.001 bpb higher. The feedback module's main contribution is providing the recurrent passes with an error signal about the previous iteration's residual.
Jacobian Proxy Loss (Stabilizer)
A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian:
with$\lambda = 0.01$ . This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$ , encouraging it to stay below 1 (contractive map). The model learns to adhere to this quickly and it does not seem to effect early training dynamics. However we did see better results with 0.01 compared to 0.1 for Lambda, potentially since the restriction of 0.1 is to high, we don't always need contractive layers with only 3x recurrance, but we do need it to not explode.
This loss term is critical for training stability. Without it, gradient norms and hidden state magnitudes explode during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation.
Note: the jacobian proxy loss is only added to the training loss — it does not affect evaluation scoring, which uses pure cross-entropy.
Legal TTT Protocol
Score-first legal TTT following PR #461:
torch.inference_mode()— no gradients, no weight mutationTiming Budget
Architecture
Built on the PR #414 stack with PR #399 Parallel Muon:
Run Command
Key flags:
torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.01 \ --no-interpass-rmsnormTricks
Graph Precompilation Warmup
torch.compileis lazy — it only compiles a new graph variant the first time it's encountered. With progressive recurrence (1→2→3 passes) and late QAT, this means the training loop would hit compilation stalls at step 4500 (2-pass), step 5500 (3-pass), and again when QAT enables. Under a 600s wallclock cap, these stalls are expensive.The fix: precompile all graph variants during warmup before training starts. During the 20 warmup steps:
num_passesvariant (2-pass, 3-pass) and each with QAT toggled ontorch.compileto eagerly compile every forward/backward graph that will appear during trainingThis ensures the training loop runs at full speed from step 0 with no compilation jitter when passes change or QAT kicks in.
Code Minification with python-minifier
The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). After removing dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging), the file was still too large.
python-minifier with
--no-rename-localsshrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups forstate_dictkeys andnamed_parameters. This brought the file from 68,435 bytes down to 58,186 bytes, comfortably fitting all seeds under the 16MB decimal limit.Note: The code was minified after all three seed runs completed, so the log files report
Code size: 88253 bytesand correspondingly largerTotal submission sizevalues. The actual submission uses the minified 58,186-byte script — the correct per-seed totals are listed insubmission.jsonand the results table above.Credits
Made with Cursor